分散算法是一种计算形式,通过依赖于直接连接代理之间的低成本通信的本地动态实现全局目标。在涉及分布式数据集的大规模优化任务中,分散算法显示出强大,有时优越,性能与中央节点的分布式算法。最近,发展分散的深度学习算法引起了极大的关注。它们被视为使用参数服务器或环形恢复协议的那些的低通信开销替代方案。但是,缺乏易于使用和高效的软件包仅在纸上保持了最分散的算法。为了填补差距,我们介绍了Bluefog,一个Python库进行了直接的,高性能的不同分散算法的实现。基于各种通信操作的统一抽象,Bluefog提供直观的接口来实现分散的算法的频谱,从使用静态无向图的那些,用于使用动态和定向图形的同步操作进行异步操作。 Bluefog还采用了多种系统级加速技术,以进一步优化深度学习任务的性能。在主流DNN培训任务中,Bluefog达到了更高的吞吐量,并实现了一个总体上的吞吐量1.2 \ times \ sim 1.8 \ times $ speedup,这是一个基于环 - allyuce的最先进的分布式深度学习包。 Bluefog是https://github.com/bluefog-lib/bluefog的开源。
translated by 谷歌翻译
作为一项具有挑战性的任务,文本到图像生成旨在根据给定的文本说明生成照片真实和语义一致的图像。现有方法主要从一个句子中提取文本信息,以表示图像,文本表示良好地影响生成图像的质量。但是,直接利用一个句子中的有限信息错过了一些关键属性描述,这是准确描述图像的关键因素。为了减轻上述问题,我们提出了一种有效的文本表示方法,并具有属性信息的补充。首先,我们构建一个属性内存,以用句子输入共同控制文本对图像生成。其次,我们探讨了两种更新机制,即样品感知和样本 - 关节机制,以动态优化广义属性存储器。此外,我们设计了一个属性句子结合条件生成器学习方案,以使多个表示的特征嵌入对齐,从而促进跨模式网络训练。实验结果表明,该提出的方法对CUB(FID从14.81到8.57)和可可(FID从21.42到12.39)的数据集获得了实质性改进。
translated by 谷歌翻译
随机平滑为对抗性扰动的认证鲁棒性取得了巨大的成功。考虑到任何任意分类器,随机平滑可以保证分类器对受扰动输入的预测,并通过将噪声注入分类器中可证明的鲁棒性。但是,所有现有方法都依赖于固定的I.I.D.概率分布以生成数据的所有维度(例如,图像中的所有像素)的噪声,该噪声忽略了输入和数据维度的异质性。因此,现有的随机平滑方法无法为所有输入提供最佳保护。为了解决这一限制,我们提出了第一个各向异性随机平滑方法,该方法可确保基于像素噪声分布的可证明的鲁棒性保证。此外,我们设计了一种新型的基于CNN的噪声发生器,以有效地对每个输入中所有像素的像素噪声分布进行有效调整。实验结果表明,我们的方法显着优于最先进的随机平滑方法。
translated by 谷歌翻译
我们研究机器学习分类器对对抗扰动的认证鲁棒性。特别是,我们提出了第一个普遍近似认证的鲁棒性(UNICR)框架,该框架可以近似于任何分类器上任何输入的鲁棒性认证,以与任何连续概率分布产生的噪声产生的任何$ \ ell_p $扰动。与最先进的认证防御措施相比,UNICR提供了许多重要的好处:(1)上述4'Any的第一个通用鲁棒性认证框架;(2)自动鲁棒性认证避免逐案分析,(3)认证鲁棒性的紧密度验证以及(4)随机平滑下使用的噪声分布的最佳验证。我们进行了广泛的实验,以验证UNICR的上述好处以及UNICR比最先进的认证防御能力对$ \ ell_p $扰动的优势。
translated by 谷歌翻译
基于生成对抗神经网络(GAN)的神经声码器由于其快速推理速度和轻量级网络而被广泛使用,同时产生了高质量的语音波形。由于感知上重要的语音成分主要集中在低频频段中,因此大多数基于GAN的神经声码器进行了多尺度分析,以评估降压化采样的语音波形。这种多尺度分析有助于发电机提高语音清晰度。然而,在初步实验中,我们观察到,重点放在低频频段的多尺度分析会导致意外的伪影,例如,混叠和成像伪像,这些文物降低了合成的语音波形质量。因此,在本文中,我们研究了这些伪影与基于GAN的神经声码器之间的关系,并提出了一个基于GAN的神经声码器,称为Avocodo,该机器人允许合成具有减少伪影的高保真语音。我们介绍了两种歧视者,以各种视角评估波形:协作多波段歧视者和一个子兰歧视器。我们还利用伪正常的镜像滤波器库来获得下采样的多频段波形,同时避免混音。实验结果表明,在语音和唱歌语音合成任务中,鳄梨的表现优于常规的基于GAN的神经声码器,并且可以合成无伪影的语音。尤其是,鳄梨甚至能够复制看不见的扬声器的高质量波形。
translated by 谷歌翻译
Video frame interpolation(VFI) is the task that synthesizes the intermediate frame given two consecutive frames. Most of the previous studies have focused on appropriate frame warping operations and refinement modules for the warped frames. These studies have been conducted on natural videos containing only continuous motions. However, many practical videos contain various unnatural objects with discontinuous motions such as logos, user interfaces and subtitles. We propose three techniques to make the existing deep learning-based VFI architectures robust to these elements. First is a novel data augmentation strategy called figure-text mixing(FTM) which can make the models learn discontinuous motions during training stage without any extra dataset. Second, we propose a simple but effective module that predicts a map called discontinuity map(D-map), which densely distinguishes between areas of continuous and discontinuous motions. Lastly, we propose loss functions to give supervisions of the discontinuous motion areas which can be applied along with FTM and D-map. We additionally collect a special test benchmark called Graphical Discontinuous Motion(GDM) dataset consisting of some mobile games and chatting videos. Applied to the various state-of-the-art VFI networks, our method significantly improves the interpolation qualities on the videos from not only GDM dataset, but also the existing benchmarks containing only continuous motions such as Vimeo90K, UCF101, and DAVIS.
translated by 谷歌翻译
为了提高图像压缩性能,最近的基于神经网络的基于神经网络的研究可以分为三类:学习编解码器,后处理网络和紧凑型表示网络。学习编解码器专为超出传统压缩模块而设计的端到端学习。后处理网络使用基于示例的学习增加解码图像的质量。学习紧凑的表示网络,以降低输入图像的容量,以减少比特率的同时保持解码图像的质量。然而,这些方法与现有的编解码器不兼容,或者不会最佳地增加编码效率。具体地,由于编解码器的不准确性,难以在先前的研究中实现最佳学习。在本文中,我们提出了一种基于辅助编解码器网络(ACN)的新颖的标准兼容图像压缩框架。 ACNS旨在模仿现有编解码器的图像劣化操作,这为紧凑型表示网络提供了更准确的梯度。因此,可以有效地和最佳地学习紧凑的表示和后处理网络。我们证明,我们基于JPEG和高效视频编码(HEVC)标准的建议框架基本上以标准的兼容方式大致优于现有的图像压缩算法。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译